Prepare Data ============ In this page, we will introduce the functions we provide to load datasets and split given data. Load Data --------- In ``s3l.datasets.base``, we provide some useful functions to load data. Here is the list: :: 'load_data', 'load_dataset', 'load_graph', 'load_boston', 'load_diabetes', 'load_digits', 'load_iris', 'load_breast_cancer', 'load_linnerud', 'load_wine', 'load_ionosphere', 'load_australian', 'load_bupa', 'load_haberman', 'load_vehicle', 'load_covtype', 'load_housing10', 'load_spambase', 'load_house', 'load_clean1' Among them, ``load_data``, ``load_dataset`` and ``load_graph`` functions can be used to load the data you prepare. Other functions load the built-in datasets which are commonly used by researchers. These functions return the data in the form which can be used by estimators directly. For example, .. code:: python X, y = load_XXX(return_X_y=False) # XXX is the name of dataset We'll show you how to use the two user-oriented functions ``load_data``, ``load_dataset`` and ``load_graph``. ``load_dataset`` is directly called in experiments classes, you can use them when you try algorithms outside experiment class or when you're implementing you own experiment class. ``load_data`` loads features and labels of a dataset given the file names. .. code:: python X, y = load_data(feature_file, label_file) ``load_dataset`` wraps ``load_data`` with another parameter *name* and loads built-in dataset if *name* matchs. .. code:: python X, y = load_dataset(name, feature_file, label_file) ``load_graph`` loads the graph in ``*.csv/npz/mat`` file and returns a matrix. .. code:: python W = load_graph(graph_file) Split Data ---------- In ``s3l.datasets.data_manipulate``, we provide some useful functions to split data. Here is the list: :: 'inductive_split', 'ratio_split', 'cv_split' Among them, ``inductive_split`` can split the dataset into three parts: labeled set, unlabeled set and testing set, which is helpful for semi-supervised learning tasks. .. code:: python from sklearn.datasets import make_classification from s3l.datasets import data_manipulate X, y = make_classification() train_idx, test_idx, label_idx, unlabel_idx = \ data_manipulate.inductive_split(X, y,test_ratio=0.3, initial_label_rate=0.05, split_count=10) ``ratio_split`` and ``cv_split`` help split the given data based on train/test ratio and k-Fold. .. code:: python from sklearn.datasets import make_classification from s3l.datasets import data_manipulate X, y = make_classification() # ratio_split train_idx, test_idx = \ data_manipulate.ratio_split(X, y, unlabel_ratio=0.3, split_count=10) # cv_split train_idx, test_idx = \ data_manipulate.cv_split(X, y, k=3, split_count=10) The returned XXX_indexes are lists of indexes which can be directly used by built-in estimators.